TO DO List

  • [ ] Define functions for library imports, data import setup and other EDA steps already finalized

Library imports


In [1]:
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import plotly.express as px
import scipy.stats as _stats
import datetime as dt
import itertools

from matplotlib import pyplot as plt
from IPython.core.display import display, HTML
from helper_functions import *
In [2]:
def jupyter_settings():
    %matplotlib inline
    %pylab inline
    
    plt.style.use( 'bmh' )
    plt.rcParams['figure.figsize'] = [20, 12]
    plt.rcParams['font.size'] = 24
    
    display( HTML( '<style>.container { width:100% !important; }</style>') )
    
    pd.options.display.max_columns = 200
    pd.options.display.max_rows = 200
    pd.set_option( 'display.expand_frame_repr', False )
    pd.set_option('display.float_format', lambda x: '%.2f' % x)
    
    sns.set()
    
    #warnings.filterwarnings("ignore")
In [3]:
jupyter_settings()
Populating the interactive namespace from numpy and matplotlib
/home/data-madness/Documents/SynologyDrive/Comunidade DS/Python do Zero ao DS/HouseSales-RecommendationSystem/houserocket_env/lib/python3.8/site-packages/IPython/core/magics/pylab.py:159: UserWarning: pylab import has clobbered these variables: ['save']
`%matplotlib` prevents importing * from pylab and numpy
  warn("pylab import has clobbered these variables: %s"  % clobbered +

Data Import and Setup


Import data


In [4]:
orig_df = pd.read_csv('data/kc_house_data.csv')
house_df_eda = orig_df.copy(deep=True)

General Info


In [5]:
house_df_eda.head()
Out[5]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 7129300520 20141013T000000 221900.00 3 1.00 1180 5650 1.00 0 0 3 7 1180 0 1955 0 98178 47.51 -122.26 1340 5650
1 6414100192 20141209T000000 538000.00 3 2.25 2570 7242 2.00 0 0 3 7 2170 400 1951 1991 98125 47.72 -122.32 1690 7639
2 5631500400 20150225T000000 180000.00 2 1.00 770 10000 1.00 0 0 3 6 770 0 1933 0 98028 47.74 -122.23 2720 8062
3 2487200875 20141209T000000 604000.00 4 3.00 1960 5000 1.00 0 0 5 7 1050 910 1965 0 98136 47.52 -122.39 1360 5000
4 1954400510 20150218T000000 510000.00 3 2.00 1680 8080 1.00 0 0 3 8 1680 0 1987 0 98074 47.62 -122.05 1800 7503
In [6]:
house_df_eda.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long           21613 non-null  float64
 19  sqft_living15  21613 non-null  int64  
 20  sqft_lot15     21613 non-null  int64  
dtypes: float64(5), int64(15), object(1)
memory usage: 3.5+ MB
In [7]:
house_df_eda.describe().T
Out[7]:
count mean std min 25% 50% 75% max
id 21613.00 4580301520.86 2876565571.31 1000102.00 2123049194.00 3904930410.00 7308900445.00 9900000190.00
price 21613.00 540088.14 367127.20 75000.00 321950.00 450000.00 645000.00 7700000.00
bedrooms 21613.00 3.37 0.93 0.00 3.00 3.00 4.00 33.00
bathrooms 21613.00 2.11 0.77 0.00 1.75 2.25 2.50 8.00
sqft_living 21613.00 2079.90 918.44 290.00 1427.00 1910.00 2550.00 13540.00
sqft_lot 21613.00 15106.97 41420.51 520.00 5040.00 7618.00 10688.00 1651359.00
floors 21613.00 1.49 0.54 1.00 1.00 1.50 2.00 3.50
waterfront 21613.00 0.01 0.09 0.00 0.00 0.00 0.00 1.00
view 21613.00 0.23 0.77 0.00 0.00 0.00 0.00 4.00
condition 21613.00 3.41 0.65 1.00 3.00 3.00 4.00 5.00
grade 21613.00 7.66 1.18 1.00 7.00 7.00 8.00 13.00
sqft_above 21613.00 1788.39 828.09 290.00 1190.00 1560.00 2210.00 9410.00
sqft_basement 21613.00 291.51 442.58 0.00 0.00 0.00 560.00 4820.00
yr_built 21613.00 1971.01 29.37 1900.00 1951.00 1975.00 1997.00 2015.00
yr_renovated 21613.00 84.40 401.68 0.00 0.00 0.00 0.00 2015.00
zipcode 21613.00 98077.94 53.51 98001.00 98033.00 98065.00 98118.00 98199.00
lat 21613.00 47.56 0.14 47.16 47.47 47.57 47.68 47.78
long 21613.00 -122.21 0.14 -122.52 -122.33 -122.23 -122.12 -121.31
sqft_living15 21613.00 1986.55 685.39 399.00 1490.00 1840.00 2360.00 6210.00
sqft_lot15 21613.00 12768.46 27304.18 651.00 5100.00 7620.00 10083.00 871200.00

1. Exploratory Data Analysis


1.1 Data Dimensionality


It is a small sized dataset, and and it will not suffer from the curse of dimensionality.

In [8]:
print('Number of Rows:{}'.format(house_df_eda.shape[0]))
print('Number of Columns {}'.format(house_df_eda.shape[1]))
Number of Rows:21613
Number of Columns 21

1.2 Attributes Name and Description


id - its a unique identifier number for each unique house

date - its the date when the house were sold

price - its the selling price when the house was sold

bedrooms - number of bedrooms in the house

bathrooms - number of bathrooms in the house, where a fraction like 0.25 represents a bathroom sink, shower or toilet

sqft_living - square footage of the apartments interior living space

sqft_lot - the size of the land in square feet

floors - number of floors in the house

waterfront - if there is a waterview from the house

view - an index from 0 to 4 of how good the view of the property was or how many views the house has.

condition - the house preservation condition

grade - a rank from 1 to 13, which ranks the construction quality

sqft_above - the size of the house above the ground level in square feet

sqft_basement - the size of the house below the ground level in square feet

yr_built - the year the house was initially built

yr_renovated - the year of the house's last renovation

zipcode - what zipcode area the house is in

lat - Lattitude

long - Longitude

sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors (possibly)

sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors


1.3 Attributes Type and Data type conversions


In [9]:
house_df_eda.dtypes
Out[9]:
id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

Date reformat type

Because it will be handled easily as a date variable type

In [10]:
house_df_eda.date = pd.to_datetime(house_df_eda.date)
In [11]:
house_df_eda.date[0]
Out[11]:
Timestamp('2014-10-13 00:00:00')

1.4 Attributes Descriptive Statistics (Summary Statistics)


In [12]:
num_attributes = house_df_eda.select_dtypes(include=['int64', 'float64'])
In [13]:
num_attributes.columns
Out[13]:
Index(['id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode', 'lat', 'long',
       'sqft_living15', 'sqft_lot15'],
      dtype='object')
In [14]:
mean = pd.DataFrame(num_attributes.mean())
std = pd.DataFrame(num_attributes.std())
median = pd.DataFrame(num_attributes.median())

q1,q3 = pd.DataFrame(num_attributes.quantile([0.25])), pd.DataFrame(num_attributes.quantile([0.75]))

maximum = pd.DataFrame(num_attributes.max())
minimum = pd.DataFrame(num_attributes.min())

skew = pd.DataFrame(num_attributes.skew())
kurtosis = pd.DataFrame(num_attributes.kurtosis())

descriptive_statistics = pd.concat([mean,std,minimum,q1.transpose(),median,q3.transpose(),maximum,skew,kurtosis],axis=1)
descriptive_statistics.columns = ['mean','std','minimum','q1','median','q3','maximum','skew','kurtosis']
descriptive_statistics
Out[14]:
mean std minimum q1 median q3 maximum skew kurtosis
id 4580301520.86 2876565571.31 1000102.00 2123049194.00 3904930410.00 7308900445.00 9900000190.00 0.24 -1.26
price 540088.14 367127.20 75000.00 321950.00 450000.00 645000.00 7700000.00 4.02 34.59
bedrooms 3.37 0.93 0.00 3.00 3.00 4.00 33.00 1.97 49.06
bathrooms 2.11 0.77 0.00 1.75 2.25 2.50 8.00 0.51 1.28
sqft_living 2079.90 918.44 290.00 1427.00 1910.00 2550.00 13540.00 1.47 5.24
sqft_lot 15106.97 41420.51 520.00 5040.00 7618.00 10688.00 1651359.00 13.06 285.08
floors 1.49 0.54 1.00 1.00 1.50 2.00 3.50 0.62 -0.48
waterfront 0.01 0.09 0.00 0.00 0.00 0.00 1.00 11.39 127.63
view 0.23 0.77 0.00 0.00 0.00 0.00 4.00 3.40 10.89
condition 3.41 0.65 1.00 3.00 3.00 4.00 5.00 1.03 0.53
grade 7.66 1.18 1.00 7.00 7.00 8.00 13.00 0.77 1.19
sqft_above 1788.39 828.09 290.00 1190.00 1560.00 2210.00 9410.00 1.45 3.40
sqft_basement 291.51 442.58 0.00 0.00 0.00 560.00 4820.00 1.58 2.72
yr_built 1971.01 29.37 1900.00 1951.00 1975.00 1997.00 2015.00 -0.47 -0.66
yr_renovated 84.40 401.68 0.00 0.00 0.00 0.00 2015.00 4.55 18.70
zipcode 98077.94 53.51 98001.00 98033.00 98065.00 98118.00 98199.00 0.41 -0.85
lat 47.56 0.14 47.16 47.47 47.57 47.68 47.78 -0.49 -0.68
long -122.21 0.14 -122.52 -122.33 -122.23 -122.12 -121.31 0.89 1.05
sqft_living15 1986.55 685.39 399.00 1490.00 1840.00 2360.00 6210.00 1.11 1.60
sqft_lot15 12768.46 27304.18 651.00 5100.00 7620.00 10083.00 871200.00 9.51 150.76

1.5 Identify missing values and Duplicated records


Missing Values

In [15]:
num_attributes.isnull().sum()
Out[15]:
id               0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
grade            0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
zipcode          0
lat              0
long             0
sqft_living15    0
sqft_lot15       0
dtype: int64
In [16]:
cat_attributes = house_df_eda.select_dtypes(exclude=['int64','float64'])

pd.isna(cat_attributes).sum()
Out[16]:
date    0
dtype: int64

Check Duplicates (Houses sold twice or more times)

In [18]:
len(house_df_eda['id'].unique())
Out[18]:
21436
In [21]:
duplicated_ids = house_df_eda['id'].apply(lambda x: True if (sum(x == house_df_eda['id']) >= 2)\
                                                     else False)

Drop the duplicates but keep the most recent, because for this business purpose it is relevant the houses that are currently on market or the real estate portfolio.

In [22]:
house_df_eda.loc[duplicated_ids,['id','date','price', 'yr_renovated','zipcode']].drop_duplicates(subset=['id'], keep='last')
Out[22]:
id date price yr_renovated zipcode
94 6021501535 2014-12-23 700000.00 0 98117
314 4139480200 2014-12-09 1400000.00 0 98006
325 7520000520 2015-03-11 240500.00 1984 98146
346 3969300030 2014-12-29 239900.00 0 98178
372 2231500030 2015-03-24 530000.00 0 98133
718 8820903380 2015-01-02 730000.00 1990 98125
824 726049190 2015-02-18 431000.00 0 98133
837 8682262400 2015-05-13 419950.00 0 98053
1086 9834200885 2015-04-20 550000.00 0 98144
1129 8062900070 2015-02-13 369000.00 0 98056
1203 1231000510 2015-05-04 510000.00 0 98118
1235 6117501820 2015-04-28 435000.00 0 98166
1451 2228900270 2015-02-12 302000.00 0 98133
1465 3630120700 2015-01-07 765000.00 0 98029
1577 7888000390 2015-04-01 235000.00 0 98198
1865 2143700830 2015-03-12 370000.00 0 98055
2039 3395040550 2015-04-29 320000.00 0 98108
2127 4345000510 2015-04-28 325000.00 0 98030
2494 9407110710 2015-02-26 322000.00 0 98045
2497 1000102 2015-04-22 300000.00 0 98002
2503 1781500435 2015-02-25 575000.00 0 98126
2532 1232000810 2015-03-26 537000.00 0 98117
2565 7856400240 2015-02-11 1650000.00 0 98006
2632 7983000200 2015-02-25 250000.00 0 98003
2977 9136103130 2015-05-12 685000.00 0 98103
3034 5083000375 2015-03-19 235000.00 0 98198
3299 7409700215 2015-03-13 921500.00 0 98115
3541 1823049202 2015-01-07 326000.00 0 98146
3624 2767602141 2014-12-22 650000.00 0 98107
3757 1254200015 2015-04-08 625000.00 0 98117
3786 723049156 2014-11-12 284700.00 0 98146
3794 4435000705 2015-03-09 255500.00 0 98188
3879 4364700600 2015-03-30 390000.00 0 98126
3951 1825069031 2014-10-16 550000.00 1980 98074
4078 3262300940 2015-02-10 940000.00 0 98039
4343 9353300600 2015-03-26 370000.00 0 98059
4873 1524079093 2015-03-18 369500.00 0 98024
4923 9809000020 2015-03-13 1940000.00 0 98004
5245 3303000130 2015-04-28 520000.00 0 98177
5252 7524400250 2014-11-24 589950.00 0 98052
5273 4204400098 2015-04-21 385000.00 0 98055
5341 6632900574 2015-02-25 595000.00 0 98155
5593 2726049071 2015-04-08 489950.00 0 98125
5695 3578401060 2015-05-04 625000.00 0 98074
5724 1788800630 2015-02-25 185000.00 0 98023
5757 6381500170 2015-01-16 365000.00 0 98125
5759 1954420170 2014-11-13 580000.00 0 98074
5973 1545800290 2015-04-08 315000.00 0 98038
6346 9828200460 2015-01-06 430000.00 0 98122
6372 9222400605 2015-04-11 850000.00 0 98115
6435 8832900780 2015-04-08 647500.00 0 98028
6534 8645530010 2015-03-25 295000.00 0 98058
6631 3323059027 2015-02-25 340000.00 0 98058
6720 5417600130 2015-05-12 301000.00 0 98065
6790 8651510380 2014-12-16 539000.00 0 98074
6903 5249801440 2015-04-22 570000.00 0 98118
7179 526059224 2015-02-06 470000.00 0 98011
7246 1446403850 2015-01-14 212000.00 0 98168
7793 251300110 2015-01-14 358000.00 0 98003
7846 4139420590 2014-08-27 1200000.00 0 98006
7848 2621600015 2015-04-30 175000.00 0 98030
8012 2206700215 2015-04-22 550000.00 0 98006
8126 2787460720 2015-05-06 259950.00 0 98031
8275 3883800011 2015-04-08 219900.00 0 98146
8341 4154300296 2015-03-18 545000.00 0 98118
8434 1721801010 2015-04-24 302100.00 1964 98146
8506 1237500540 2014-12-22 270000.00 0 98052
8533 5101405604 2015-04-28 395000.00 0 98125
8631 5127001320 2015-02-23 314950.00 0 98059
8905 5282200015 2015-01-26 840000.00 0 98115
8916 7977201065 2015-03-05 740000.00 0 98115
9017 3739500096 2015-05-05 430000.00 0 98155
9115 1568100300 2015-01-21 682500.00 0 98155
9121 1922059278 2015-03-05 255000.00 0 98030
9235 302000375 2015-05-06 250000.00 0 98001
9276 722039087 2015-05-04 329000.00 0 98070
9278 8129700644 2015-04-24 780000.00 0 98103
9280 4443800385 2015-05-06 778100.00 0 98117
9393 2212200100 2015-04-22 344900.00 0 98031
9440 4302201085 2015-05-06 546940.00 0 98106
9490 5054800110 2015-02-13 328000.00 0 98055
9507 8103000110 2015-02-05 490000.00 0 98146
9721 2568300045 2015-03-19 649950.00 0 98125
9821 3598600049 2015-04-24 224000.00 0 98168
9877 1219000473 2015-03-23 371000.00 0 98166
10222 3293700496 2014-12-04 450000.00 0 98133
10227 8945100320 2014-10-08 224097.00 0 98023
10254 2422049104 2014-12-30 235000.00 0 98032
10258 1995200200 2014-10-08 415000.00 0 98115
10273 1450100390 2015-03-16 208000.00 0 98002
10320 6623400187 2015-02-20 365000.00 0 98055
10611 2767603612 2015-01-13 489000.00 0 98107
10917 7227800055 2014-11-24 247000.00 0 98056
10970 3271300955 2015-02-24 868000.00 0 98199
11062 6141100320 2015-02-13 570000.00 0 98133
11195 1139600270 2015-03-24 310000.00 0 98023
11203 8161020060 2015-04-14 471000.00 0 98014
11288 5101402435 2015-03-04 539000.00 0 98115
11363 7701960990 2014-08-19 870000.00 0 98077
11434 109200390 2014-10-20 250000.00 0 98023
11459 4305200070 2014-09-22 561000.00 0 98007
11587 8807810890 2014-11-05 385000.00 0 98053
11776 2892700041 2015-01-28 238000.00 0 98055
12031 7230400400 2015-03-26 415500.00 0 98059
12066 1036400200 2015-04-29 697000.00 0 98052
12122 7520000695 2015-04-21 251000.00 0 98146
12209 4031000520 2014-11-25 227000.00 0 98001
12333 5430300171 2015-01-29 615500.00 0 98115
12339 5132000140 2015-01-20 415000.00 0 98106
12378 7899800045 2014-12-02 232900.00 0 98106
12389 3935900232 2015-01-12 237000.00 0 98125
12418 123039336 2014-12-08 244900.00 0 98106
12435 1217000340 2015-02-19 340000.00 0 98166
12833 9238500040 2015-02-10 599000.00 0 98072
12921 8910500150 2015-01-20 539000.00 0 98133
12956 2019200220 2015-02-26 269000.00 0 98003
13024 9211500620 2015-04-28 305000.00 0 98023
13184 1523049207 2015-01-14 220000.00 0 98168
13299 7167000040 2015-03-05 700000.00 0 98010
13618 3185600040 2014-12-24 310000.00 0 98055
13629 2023049218 2015-03-16 445000.00 0 98148
13660 3432501415 2014-11-11 399000.00 0 98155
13693 937000330 2015-03-19 246500.00 0 98198
13735 5332200530 2015-04-24 1015000.00 0 98112
13757 1788900230 2015-04-03 199950.00 0 98023
14221 1250201165 2015-03-17 474500.00 0 98144
14308 3528000040 2015-03-26 1800000.00 0 98053
14367 3333002450 2015-01-22 490000.00 0 98118
14485 3523069060 2015-04-01 415000.00 0 98038
14575 1432400120 2015-05-08 255000.00 0 98058
14856 324000530 2015-03-23 459000.00 0 98116
14983 6308000010 2015-04-23 585000.00 0 98006
14994 4222310010 2015-04-20 267950.00 0 98003
15002 7504021310 2014-12-04 745000.00 0 98074
15147 7961500010 2015-03-04 520000.00 0 98178
15157 6669020290 2015-03-04 279950.00 0 98032
15200 9834200305 2015-02-10 615000.00 0 98144
15278 2619920170 2014-12-19 765000.00 0 98033
15300 705730280 2015-04-21 335000.00 0 98038
15366 7972000010 2014-10-21 195000.00 0 98023
15469 4202400078 2015-04-28 335000.00 0 98055
15569 2560801222 2014-11-13 309950.00 0 98198
15608 3904100089 2015-03-18 300000.00 0 98118
15657 7893805650 2015-03-13 475000.00 0 98198
15802 2044500213 2015-01-26 449000.00 0 98125
15811 7129304540 2015-05-14 440000.00 0 98118
16001 2561340020 2015-02-17 500000.00 0 98074
16013 5536100020 2015-05-12 1190000.00 0 98004
16672 9250900104 2015-04-10 496000.00 0 98133
16815 7200179 2015-04-24 175000.00 0 98055
17067 8651402750 2015-04-30 219950.00 0 98042
17134 6143000020 2015-04-06 299000.00 0 98001
17205 4139440480 2014-12-01 796500.00 0 98006
17281 2473380920 2015-02-27 339000.00 0 98058
17291 1974300020 2015-02-18 624900.00 0 98034
17382 641900050 2015-02-06 499950.00 0 98133
17483 7387500235 2015-03-17 363000.00 0 98106
17581 1423049019 2015-03-31 220000.00 0 98178
17604 795000620 2015-03-11 157000.00 0 98168
17764 1630700361 2015-04-09 583500.00 0 98077
17852 6791200120 2015-04-07 515000.00 0 98075
18234 6751300375 2014-10-16 522500.00 0 98007
18565 7657000540 2015-03-04 260000.00 0 98178
18704 3558900590 2015-03-24 692500.00 0 98034
18706 8121100395 2015-03-11 645000.00 0 98118
18763 3332000615 2015-04-22 389000.00 0 98118
18992 7856400300 2015-03-22 1505000.00 0 98006
19210 1901600090 2015-04-26 390000.00 0 98166
19336 6021500970 2015-04-07 874950.00 0 98117
19553 643300040 2015-03-13 719521.00 0 98006
20054 8648900110 2014-08-26 555000.00 0 98027
20181 7853400250 2015-02-19 645000.00 0 98065
20613 2724049222 2014-12-01 220000.00 0 98118
20670 8564860270 2015-03-30 502000.00 0 98045
20780 6300000226 2015-05-04 380000.00 0 98133
21581 7853420110 2015-05-04 625000.00 0 98065

1.6 Data Granularity


Geographic Data

The geographic data attributes latitude and longitude are in their finest granularity, but zipcode despite being in their purest from in terms of value, it could be transformed into other finer variables to describe specifically the location of the house, for example, the street or a more coarse one as county.

Temporal Data

The data attributes yr_built and yr_renovated are interval-based and nominal, respectively. And both have an year level of granularity. The date attribute can be considered as interval-based with a day level of temporal granularity

In [23]:
# house_df_eda['yr_built'].sort_values().unique()
Out[23]:
array([1900, 1901, 1902, 1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910,
       1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918, 1919, 1920, 1921,
       1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932,
       1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943,
       1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954,
       1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965,
       1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976,
       1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013, 2014, 2015])
In [24]:
# house_df_eda['yr_renovated'].sort_values().unique()
Out[24]:
array([   0, 1934, 1940, 1944, 1945, 1946, 1948, 1950, 1951, 1953, 1954,
       1955, 1956, 1957, 1958, 1959, 1960, 1962, 1963, 1964, 1965, 1967,
       1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978,
       1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989,
       1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015])
In [25]:
# house_df_eda['date'].sort_values().unique()
Out[25]:
array(['2014-05-02T00:00:00.000000000', '2014-05-03T00:00:00.000000000',
       '2014-05-04T00:00:00.000000000', '2014-05-05T00:00:00.000000000',
       '2014-05-06T00:00:00.000000000', '2014-05-07T00:00:00.000000000',
       '2014-05-08T00:00:00.000000000', '2014-05-09T00:00:00.000000000',
       '2014-05-10T00:00:00.000000000', '2014-05-11T00:00:00.000000000',
       '2014-05-12T00:00:00.000000000', '2014-05-13T00:00:00.000000000',
       '2014-05-14T00:00:00.000000000', '2014-05-15T00:00:00.000000000',
       '2014-05-16T00:00:00.000000000', '2014-05-17T00:00:00.000000000',
       '2014-05-18T00:00:00.000000000', '2014-05-19T00:00:00.000000000',
       '2014-05-20T00:00:00.000000000', '2014-05-21T00:00:00.000000000',
       '2014-05-22T00:00:00.000000000', '2014-05-23T00:00:00.000000000',
       '2014-05-24T00:00:00.000000000', '2014-05-25T00:00:00.000000000',
       '2014-05-26T00:00:00.000000000', '2014-05-27T00:00:00.000000000',
       '2014-05-28T00:00:00.000000000', '2014-05-29T00:00:00.000000000',
       '2014-05-30T00:00:00.000000000', '2014-05-31T00:00:00.000000000',
       '2014-06-01T00:00:00.000000000', '2014-06-02T00:00:00.000000000',
       '2014-06-03T00:00:00.000000000', '2014-06-04T00:00:00.000000000',
       '2014-06-05T00:00:00.000000000', '2014-06-06T00:00:00.000000000',
       '2014-06-07T00:00:00.000000000', '2014-06-08T00:00:00.000000000',
       '2014-06-09T00:00:00.000000000', '2014-06-10T00:00:00.000000000',
       '2014-06-11T00:00:00.000000000', '2014-06-12T00:00:00.000000000',
       '2014-06-13T00:00:00.000000000', '2014-06-14T00:00:00.000000000',
       '2014-06-15T00:00:00.000000000', '2014-06-16T00:00:00.000000000',
       '2014-06-17T00:00:00.000000000', '2014-06-18T00:00:00.000000000',
       '2014-06-19T00:00:00.000000000', '2014-06-20T00:00:00.000000000',
       '2014-06-21T00:00:00.000000000', '2014-06-22T00:00:00.000000000',
       '2014-06-23T00:00:00.000000000', '2014-06-24T00:00:00.000000000',
       '2014-06-25T00:00:00.000000000', '2014-06-26T00:00:00.000000000',
       '2014-06-27T00:00:00.000000000', '2014-06-28T00:00:00.000000000',
       '2014-06-29T00:00:00.000000000', '2014-06-30T00:00:00.000000000',
       '2014-07-01T00:00:00.000000000', '2014-07-02T00:00:00.000000000',
       '2014-07-03T00:00:00.000000000', '2014-07-04T00:00:00.000000000',
       '2014-07-05T00:00:00.000000000', '2014-07-06T00:00:00.000000000',
       '2014-07-07T00:00:00.000000000', '2014-07-08T00:00:00.000000000',
       '2014-07-09T00:00:00.000000000', '2014-07-10T00:00:00.000000000',
       '2014-07-11T00:00:00.000000000', '2014-07-12T00:00:00.000000000',
       '2014-07-13T00:00:00.000000000', '2014-07-14T00:00:00.000000000',
       '2014-07-15T00:00:00.000000000', '2014-07-16T00:00:00.000000000',
       '2014-07-17T00:00:00.000000000', '2014-07-18T00:00:00.000000000',
       '2014-07-19T00:00:00.000000000', '2014-07-20T00:00:00.000000000',
       '2014-07-21T00:00:00.000000000', '2014-07-22T00:00:00.000000000',
       '2014-07-23T00:00:00.000000000', '2014-07-24T00:00:00.000000000',
       '2014-07-25T00:00:00.000000000', '2014-07-26T00:00:00.000000000',
       '2014-07-27T00:00:00.000000000', '2014-07-28T00:00:00.000000000',
       '2014-07-29T00:00:00.000000000', '2014-07-30T00:00:00.000000000',
       '2014-07-31T00:00:00.000000000', '2014-08-01T00:00:00.000000000',
       '2014-08-02T00:00:00.000000000', '2014-08-03T00:00:00.000000000',
       '2014-08-04T00:00:00.000000000', '2014-08-05T00:00:00.000000000',
       '2014-08-06T00:00:00.000000000', '2014-08-07T00:00:00.000000000',
       '2014-08-08T00:00:00.000000000', '2014-08-09T00:00:00.000000000',
       '2014-08-10T00:00:00.000000000', '2014-08-11T00:00:00.000000000',
       '2014-08-12T00:00:00.000000000', '2014-08-13T00:00:00.000000000',
       '2014-08-14T00:00:00.000000000', '2014-08-15T00:00:00.000000000',
       '2014-08-16T00:00:00.000000000', '2014-08-17T00:00:00.000000000',
       '2014-08-18T00:00:00.000000000', '2014-08-19T00:00:00.000000000',
       '2014-08-20T00:00:00.000000000', '2014-08-21T00:00:00.000000000',
       '2014-08-22T00:00:00.000000000', '2014-08-23T00:00:00.000000000',
       '2014-08-24T00:00:00.000000000', '2014-08-25T00:00:00.000000000',
       '2014-08-26T00:00:00.000000000', '2014-08-27T00:00:00.000000000',
       '2014-08-28T00:00:00.000000000', '2014-08-29T00:00:00.000000000',
       '2014-08-30T00:00:00.000000000', '2014-08-31T00:00:00.000000000',
       '2014-09-01T00:00:00.000000000', '2014-09-02T00:00:00.000000000',
       '2014-09-03T00:00:00.000000000', '2014-09-04T00:00:00.000000000',
       '2014-09-05T00:00:00.000000000', '2014-09-06T00:00:00.000000000',
       '2014-09-07T00:00:00.000000000', '2014-09-08T00:00:00.000000000',
       '2014-09-09T00:00:00.000000000', '2014-09-10T00:00:00.000000000',
       '2014-09-11T00:00:00.000000000', '2014-09-12T00:00:00.000000000',
       '2014-09-13T00:00:00.000000000', '2014-09-14T00:00:00.000000000',
       '2014-09-15T00:00:00.000000000', '2014-09-16T00:00:00.000000000',
       '2014-09-17T00:00:00.000000000', '2014-09-18T00:00:00.000000000',
       '2014-09-19T00:00:00.000000000', '2014-09-20T00:00:00.000000000',
       '2014-09-21T00:00:00.000000000', '2014-09-22T00:00:00.000000000',
       '2014-09-23T00:00:00.000000000', '2014-09-24T00:00:00.000000000',
       '2014-09-25T00:00:00.000000000', '2014-09-26T00:00:00.000000000',
       '2014-09-27T00:00:00.000000000', '2014-09-28T00:00:00.000000000',
       '2014-09-29T00:00:00.000000000', '2014-09-30T00:00:00.000000000',
       '2014-10-01T00:00:00.000000000', '2014-10-02T00:00:00.000000000',
       '2014-10-03T00:00:00.000000000', '2014-10-04T00:00:00.000000000',
       '2014-10-05T00:00:00.000000000', '2014-10-06T00:00:00.000000000',
       '2014-10-07T00:00:00.000000000', '2014-10-08T00:00:00.000000000',
       '2014-10-09T00:00:00.000000000', '2014-10-10T00:00:00.000000000',
       '2014-10-11T00:00:00.000000000', '2014-10-12T00:00:00.000000000',
       '2014-10-13T00:00:00.000000000', '2014-10-14T00:00:00.000000000',
       '2014-10-15T00:00:00.000000000', '2014-10-16T00:00:00.000000000',
       '2014-10-17T00:00:00.000000000', '2014-10-18T00:00:00.000000000',
       '2014-10-19T00:00:00.000000000', '2014-10-20T00:00:00.000000000',
       '2014-10-21T00:00:00.000000000', '2014-10-22T00:00:00.000000000',
       '2014-10-23T00:00:00.000000000', '2014-10-24T00:00:00.000000000',
       '2014-10-25T00:00:00.000000000', '2014-10-26T00:00:00.000000000',
       '2014-10-27T00:00:00.000000000', '2014-10-28T00:00:00.000000000',
       '2014-10-29T00:00:00.000000000', '2014-10-30T00:00:00.000000000',
       '2014-10-31T00:00:00.000000000', '2014-11-01T00:00:00.000000000',
       '2014-11-02T00:00:00.000000000', '2014-11-03T00:00:00.000000000',
       '2014-11-04T00:00:00.000000000', '2014-11-05T00:00:00.000000000',
       '2014-11-06T00:00:00.000000000', '2014-11-07T00:00:00.000000000',
       '2014-11-08T00:00:00.000000000', '2014-11-09T00:00:00.000000000',
       '2014-11-10T00:00:00.000000000', '2014-11-11T00:00:00.000000000',
       '2014-11-12T00:00:00.000000000', '2014-11-13T00:00:00.000000000',
       '2014-11-14T00:00:00.000000000', '2014-11-15T00:00:00.000000000',
       '2014-11-16T00:00:00.000000000', '2014-11-17T00:00:00.000000000',
       '2014-11-18T00:00:00.000000000', '2014-11-19T00:00:00.000000000',
       '2014-11-20T00:00:00.000000000', '2014-11-21T00:00:00.000000000',
       '2014-11-22T00:00:00.000000000', '2014-11-23T00:00:00.000000000',
       '2014-11-24T00:00:00.000000000', '2014-11-25T00:00:00.000000000',
       '2014-11-26T00:00:00.000000000', '2014-11-28T00:00:00.000000000',
       '2014-11-29T00:00:00.000000000', '2014-11-30T00:00:00.000000000',
       '2014-12-01T00:00:00.000000000', '2014-12-02T00:00:00.000000000',
       '2014-12-03T00:00:00.000000000', '2014-12-04T00:00:00.000000000',
       '2014-12-05T00:00:00.000000000', '2014-12-06T00:00:00.000000000',
       '2014-12-07T00:00:00.000000000', '2014-12-08T00:00:00.000000000',
       '2014-12-09T00:00:00.000000000', '2014-12-10T00:00:00.000000000',
       '2014-12-11T00:00:00.000000000', '2014-12-12T00:00:00.000000000',
       '2014-12-13T00:00:00.000000000', '2014-12-14T00:00:00.000000000',
       '2014-12-15T00:00:00.000000000', '2014-12-16T00:00:00.000000000',
       '2014-12-17T00:00:00.000000000', '2014-12-18T00:00:00.000000000',
       '2014-12-19T00:00:00.000000000', '2014-12-20T00:00:00.000000000',
       '2014-12-21T00:00:00.000000000', '2014-12-22T00:00:00.000000000',
       '2014-12-23T00:00:00.000000000', '2014-12-24T00:00:00.000000000',
       '2014-12-26T00:00:00.000000000', '2014-12-27T00:00:00.000000000',
       '2014-12-29T00:00:00.000000000', '2014-12-30T00:00:00.000000000',
       '2014-12-31T00:00:00.000000000', '2015-01-02T00:00:00.000000000',
       '2015-01-05T00:00:00.000000000', '2015-01-06T00:00:00.000000000',
       '2015-01-07T00:00:00.000000000', '2015-01-08T00:00:00.000000000',
       '2015-01-09T00:00:00.000000000', '2015-01-10T00:00:00.000000000',
       '2015-01-12T00:00:00.000000000', '2015-01-13T00:00:00.000000000',
       '2015-01-14T00:00:00.000000000', '2015-01-15T00:00:00.000000000',
       '2015-01-16T00:00:00.000000000', '2015-01-17T00:00:00.000000000',
       '2015-01-19T00:00:00.000000000', '2015-01-20T00:00:00.000000000',
       '2015-01-21T00:00:00.000000000', '2015-01-22T00:00:00.000000000',
       '2015-01-23T00:00:00.000000000', '2015-01-24T00:00:00.000000000',
       '2015-01-25T00:00:00.000000000', '2015-01-26T00:00:00.000000000',
       '2015-01-27T00:00:00.000000000', '2015-01-28T00:00:00.000000000',
       '2015-01-29T00:00:00.000000000', '2015-01-30T00:00:00.000000000',
       '2015-01-31T00:00:00.000000000', '2015-02-01T00:00:00.000000000',
       '2015-02-02T00:00:00.000000000', '2015-02-03T00:00:00.000000000',
       '2015-02-04T00:00:00.000000000', '2015-02-05T00:00:00.000000000',
       '2015-02-06T00:00:00.000000000', '2015-02-07T00:00:00.000000000',
       '2015-02-09T00:00:00.000000000', '2015-02-10T00:00:00.000000000',
       '2015-02-11T00:00:00.000000000', '2015-02-12T00:00:00.000000000',
       '2015-02-13T00:00:00.000000000', '2015-02-14T00:00:00.000000000',
       '2015-02-15T00:00:00.000000000', '2015-02-16T00:00:00.000000000',
       '2015-02-17T00:00:00.000000000', '2015-02-18T00:00:00.000000000',
       '2015-02-19T00:00:00.000000000', '2015-02-20T00:00:00.000000000',
       '2015-02-21T00:00:00.000000000', '2015-02-22T00:00:00.000000000',
       '2015-02-23T00:00:00.000000000', '2015-02-24T00:00:00.000000000',
       '2015-02-25T00:00:00.000000000', '2015-02-26T00:00:00.000000000',
       '2015-02-27T00:00:00.000000000', '2015-02-28T00:00:00.000000000',
       '2015-03-01T00:00:00.000000000', '2015-03-02T00:00:00.000000000',
       '2015-03-03T00:00:00.000000000', '2015-03-04T00:00:00.000000000',
       '2015-03-05T00:00:00.000000000', '2015-03-06T00:00:00.000000000',
       '2015-03-07T00:00:00.000000000', '2015-03-08T00:00:00.000000000',
       '2015-03-09T00:00:00.000000000', '2015-03-10T00:00:00.000000000',
       '2015-03-11T00:00:00.000000000', '2015-03-12T00:00:00.000000000',
       '2015-03-13T00:00:00.000000000', '2015-03-14T00:00:00.000000000',
       '2015-03-15T00:00:00.000000000', '2015-03-16T00:00:00.000000000',
       '2015-03-17T00:00:00.000000000', '2015-03-18T00:00:00.000000000',
       '2015-03-19T00:00:00.000000000', '2015-03-20T00:00:00.000000000',
       '2015-03-21T00:00:00.000000000', '2015-03-22T00:00:00.000000000',
       '2015-03-23T00:00:00.000000000', '2015-03-24T00:00:00.000000000',
       '2015-03-25T00:00:00.000000000', '2015-03-26T00:00:00.000000000',
       '2015-03-27T00:00:00.000000000', '2015-03-28T00:00:00.000000000',
       '2015-03-29T00:00:00.000000000', '2015-03-30T00:00:00.000000000',
       '2015-03-31T00:00:00.000000000', '2015-04-01T00:00:00.000000000',
       '2015-04-02T00:00:00.000000000', '2015-04-03T00:00:00.000000000',
       '2015-04-04T00:00:00.000000000', '2015-04-05T00:00:00.000000000',
       '2015-04-06T00:00:00.000000000', '2015-04-07T00:00:00.000000000',
       '2015-04-08T00:00:00.000000000', '2015-04-09T00:00:00.000000000',
       '2015-04-10T00:00:00.000000000', '2015-04-11T00:00:00.000000000',
       '2015-04-12T00:00:00.000000000', '2015-04-13T00:00:00.000000000',
       '2015-04-14T00:00:00.000000000', '2015-04-15T00:00:00.000000000',
       '2015-04-16T00:00:00.000000000', '2015-04-17T00:00:00.000000000',
       '2015-04-18T00:00:00.000000000', '2015-04-19T00:00:00.000000000',
       '2015-04-20T00:00:00.000000000', '2015-04-21T00:00:00.000000000',
       '2015-04-22T00:00:00.000000000', '2015-04-23T00:00:00.000000000',
       '2015-04-24T00:00:00.000000000', '2015-04-25T00:00:00.000000000',
       '2015-04-26T00:00:00.000000000', '2015-04-27T00:00:00.000000000',
       '2015-04-28T00:00:00.000000000', '2015-04-29T00:00:00.000000000',
       '2015-04-30T00:00:00.000000000', '2015-05-01T00:00:00.000000000',
       '2015-05-02T00:00:00.000000000', '2015-05-03T00:00:00.000000000',
       '2015-05-04T00:00:00.000000000', '2015-05-05T00:00:00.000000000',
       '2015-05-06T00:00:00.000000000', '2015-05-07T00:00:00.000000000',
       '2015-05-08T00:00:00.000000000', '2015-05-09T00:00:00.000000000',
       '2015-05-10T00:00:00.000000000', '2015-05-11T00:00:00.000000000',
       '2015-05-12T00:00:00.000000000', '2015-05-13T00:00:00.000000000',
       '2015-05-14T00:00:00.000000000', '2015-05-15T00:00:00.000000000',
       '2015-05-24T00:00:00.000000000', '2015-05-27T00:00:00.000000000'],
      dtype='datetime64[ns]')

Numerical Data

All numeric variables {'id', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15'}

have the finest granularity, they are not composed or discretized in any way.

1.7 Data Distribution


In [26]:
# values = num_attributes['sqft_living'].sort_values().values
# tuplex = plt.hist(values, 20, density=True)
In [27]:
# rows, cols = choose_grid(len(num_attributes.columns))

# iterate_through_data(num_attributes,num_attributes.columns,rows,cols)
Rows 7, Columns 3
Feature id
Feature price
Feature bedrooms
Feature bathrooms
Feature sqft_living
Feature sqft_lot
Feature floors
Feature waterfront
Feature view
Feature condition
Feature grade
Feature sqft_above
Feature sqft_basement
Feature yr_built
Feature yr_renovated
Feature zipcode
Feature lat
Feature long
Feature sqft_living15
Feature sqft_lot15
In [28]:
# identify_best_fit_distribution(num_attributes,'price')
<scipy.stats._continuous_distns.norm_gen object at 0x7ff74265c7c0>
(540088.1417665294, 367118.7031813723)

<scipy.stats._continuous_distns.expon_gen object at 0x7ff7426822b0>
(75000.0, 465088.1417665294)

<scipy.stats._continuous_distns.lognorm_gen object at 0x7ff742417460>
(11.772290708305558, 74999.99999999999, 5.139744617075861)


Distributions listed by Betterment of fit:
............................................
  Distribution  chi_square
0         norm    10011.41
1        expon    15939.56
2      lognorm   186500.36
Out[28]:
'norm'
In [29]:
# identify_best_fit_distribution(num_attributes,'sqft_lot')
<scipy.stats._continuous_distns.norm_gen object at 0x7ff74265c7c0>
(15106.967565816869, 41419.55327262768)

<scipy.stats._continuous_distns.expon_gen object at 0x7ff7426822b0>
(520.0, 14586.967565816869)

<scipy.stats._continuous_distns.lognorm_gen object at 0x7ff742417460>
(0.9585570130585304, 343.4371825862659, 7513.086087116011)


Distributions listed by Betterment of fit:
............................................
  Distribution  chi_square
0      lognorm     2868.73
1        expon     7064.00
2         norm    72229.81
Out[29]:
'lognorm'

1.8 Data Sparsity


In [30]:
# multiple_scatter_plots(house_df_eda.loc[:, house_df_eda.columns != 'id'])

Correlation Analysis

It would be important to do some correlation analysis with the derived measures in the second cycle.

The sqft_living and sqft_above have an obvious high positive correlation.

Both sqft_living and sqft_above have an high positive correlation with the grade. So for the renovation question, it would be interesting to see if increasing sqft_living the selling price would also increase.

In [31]:
# fig = plt.figure(figsize=[12, 12])
# corr_mtx = num_attributes.corr()
# sns.heatmap(corr_mtx, xticklabels=corr_mtx.columns, yticklabels=corr_mtx.columns, annot=True, cmap='Blues')
# plt.title('Correlation analysis')
# plt.show()

Outlier Analysis

Outliers still not analyzed

In [32]:
# Outlier identification functions

def z_score(data, k_dev):
    mean = np.round(data.mean(), decimals=2)
    std_dev = np.round(data.std(), decimals=2)

    z_scores = [ (x - mean)/std_dev for x in data ]
    return data[(np.abs(z_scores) > k_dev)]

def modified_zscore(data, thresh):

    median = np.median(data)
    median_absolute_deviation = np.median([np.abs(x - median) for x in data])

    modified_z_scores = [0.6745 * (x - median) / median_absolute_deviation for x in data]
    return data[(np.abs(modified_z_scores) > thresh)]

def iqr(data, dist):
  
    Q1 = data.quantile(0.25)
    Q3 = data.quantile(0.75)
    IQR = Q3 - Q1

    return data[(data < (Q1 - (dist * IQR))) | (data > (Q3 + (dist * IQR)))]
In [33]:
house_df_eda.loc[:5, ~house_df_eda.columns.isin(['id','date'])]
Out[33]:
price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
0 221900.00 3 1.00 1180 5650 1.00 0 0 3 7 1180 0 1955 0 98178 47.51 -122.26 1340 5650
1 538000.00 3 2.25 2570 7242 2.00 0 0 3 7 2170 400 1951 1991 98125 47.72 -122.32 1690 7639
2 180000.00 2 1.00 770 10000 1.00 0 0 3 6 770 0 1933 0 98028 47.74 -122.23 2720 8062
3 604000.00 4 3.00 1960 5000 1.00 0 0 5 7 1050 910 1965 0 98136 47.52 -122.39 1360 5000
4 510000.00 3 2.00 1680 8080 1.00 0 0 3 8 1680 0 1987 0 98074 47.62 -122.05 1800 7503
5 1225000.00 4 4.50 5420 101930 1.00 0 0 3 11 3890 1530 2001 0 98053 47.66 -122.00 4760 101930
In [34]:
outliers = {}
for feature in num_attributes.loc[:, ~num_attributes.columns.isin(['id','date'])].columns:
#     print(f"{feature} outliers", end="\n")
#     print(iqr(num_attributes[feature], 1.5))
    outliers[feature] = iqr(num_attributes[feature], 1.5)
In [35]:
outliers['bedrooms'].unique()
Out[35]:
array([ 1,  6,  7,  0,  8,  9, 11, 10, 33])

1.9 Identify Data Inconsistency


Some inconsitent records or records with inconsistent values must be removed if they are not relevant outliers, because they will bias the analysis and future model training.

In [36]:
house_df_eda_inconsistency = house_df_eda.copy(deep=True)
In [37]:
house_df_eda_inconsistency.loc[house_df_eda_inconsistency['bedrooms'] == 0]
Out[37]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
875 6306400140 2014-06-12 1095000.00 0 0.00 3064 4764 3.50 0 2 3 7 3064 0 1990 0 98102 47.64 -122.32 2360 4000
3119 3918400017 2015-02-05 380000.00 0 0.00 1470 979 3.00 0 2 3 8 1470 0 2006 0 98133 47.71 -122.36 1470 1399
3467 1453602309 2014-08-05 288000.00 0 1.50 1430 1650 3.00 0 0 3 7 1430 0 1999 0 98125 47.72 -122.29 1430 1650
4868 6896300380 2014-10-02 228000.00 0 1.00 390 5900 1.00 0 0 2 4 390 0 1953 0 98118 47.53 -122.26 2170 6000
6994 2954400190 2014-06-24 1295650.00 0 0.00 4810 28008 2.00 0 0 3 12 4810 0 1990 0 98053 47.66 -122.07 4740 35061
8477 2569500210 2014-11-17 339950.00 0 2.50 2290 8319 2.00 0 0 3 8 2290 0 1985 0 98042 47.35 -122.15 2500 8751
8484 2310060040 2014-09-25 240000.00 0 2.50 1810 5669 2.00 0 0 3 7 1810 0 2003 0 98038 47.35 -122.05 1810 5685
9773 3374500520 2015-04-29 355000.00 0 0.00 2460 8049 2.00 0 0 3 8 2460 0 1990 0 98031 47.41 -122.17 2520 8050
9854 7849202190 2014-12-23 235000.00 0 0.00 1470 4800 2.00 0 0 3 7 1470 0 1996 0 98065 47.53 -121.83 1060 7200
12653 7849202299 2015-02-18 320000.00 0 2.50 1490 7111 2.00 0 0 3 7 1490 0 1999 0 98065 47.53 -121.83 1500 4675
14423 9543000205 2015-04-13 139950.00 0 0.00 844 4269 1.00 0 0 4 7 844 0 1913 0 98001 47.28 -122.25 1380 9600
18379 1222029077 2014-10-29 265000.00 0 0.75 384 213444 1.00 0 0 3 4 384 0 2003 0 98070 47.42 -122.49 1920 224341
19452 3980300371 2014-09-26 142000.00 0 0.00 290 20875 1.00 0 0 1 1 290 0 1963 0 98024 47.53 -121.89 1620 22850
In [38]:
house_df_eda_inconsistency.loc[house_df_eda_inconsistency['bedrooms'] == 33]
Out[38]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15
15870 2402100895 2014-06-25 640000.00 33 1.75 1620 6000 1.00 0 0 5 7 1040 580 1947 0 98103 47.69 -122.33 1330 4700
In [39]:
house_df_eda_inconsistency.yr_renovated.sort_values().unique()
Out[39]:
array([   0, 1934, 1940, 1944, 1945, 1946, 1948, 1950, 1951, 1953, 1954,
       1955, 1956, 1957, 1958, 1959, 1960, 1962, 1963, 1964, 1965, 1967,
       1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978,
       1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989,
       1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015])
In [40]:
len(house_df_eda_inconsistency.loc[house_df_eda_inconsistency['yr_renovated'] == 0])
Out[40]:
20699
In [41]:
house_df_eda_inconsistency['living_area_above_and_basement'] = house_df_eda_inconsistency['sqft_living'] - (house_df_eda_inconsistency['sqft_above'] + house_df_eda_inconsistency['sqft_basement'])
In [42]:
house_df_eda_inconsistency.loc[house_df_eda_inconsistency['living_area_above_and_basement'] < 0]
Out[42]:
id date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition grade sqft_above sqft_basement yr_built yr_renovated zipcode lat long sqft_living15 sqft_lot15 living_area_above_and_basement

Inconsistent features values

  1. bedrooms

    1. There are some houses with 0 bedrooms, but may be houses with no purpose of living.

    2. There is some house record with 33 bedrooms with only 1040 square feet of infraestructure size

  1. bathrooms
    1. There are some houses with bathrooms variable with decimal places, like 1.75, that could be inconsistency, but it is assumed each 1 unit of bathroom is a complete bathroom with sink, shower or toilet, each one being 0.25.
  1. yr_renovated
    1. 20699 records have the year renovated variable equal to 0
  1. sqft_living - (sqft_above - sqft_basement)

    It wasn't found any inconsistency

Inconsistent records

1.10 Business Data Assumptions


It is important to do some business data assumption to better analyze the data and take interpretable insights.

  1. Region granularity

    There is some difference between sub-regions in King County, so there must be a separation between regions, and have a specific analysis for each. Since a generalized analysis would nullify the correlation of each region, and would be harder to take insights. This region segmentation can be made in different granularities, that is it can be divided by North, South, and East Mountains, but it can also have a finer granularity that is for zipcode regions. Another finer granularity can be the street or street block by collecting that data from the latitude and longitude, and counting the number of houses per region to assess if there is enough data for each street to get significant statistical data.

  1. Bedrooms inconsistency

    Houses with 0 bedrooms, may be for other purposes apart from living usage. (Confirm all records consistency, analysing one by one or doing some summary statistics)

  1. Year Renovated inconsistency

    It can be assumed that houses with value 0 on yr_renovated is basically houses that have not been renovated yet.

  1. Condition vs Grade relevance

    The condition variable it seems to have a higher weight in long-term than grade, in terms of evaluating which houses are better to buy, since the grade is the infraestructure construction quality and it cannot change so much with time, unlike condition that is the house preservation and it can it can increase with some maintenance works.

In [ ]:
 

1.12 Attributes Usefulness for the tasks


First intuition without looking to EDA, and only based on the business understanding. (On the second cycle will be based on the EDA)

Business: Buy and sell houses in Real Estate market, with some profit.

Business problems:

1. Which houses should be bought and for what price?

2. Once its bought when it's the best time period to sell it and for what price?

3. To rise the housing selling price, the company should do a renovation. So what would be good renewal changes?

This tasks can easily be done as a Classification Task, after doing the labelling task of giving a label to each record, telling if the house should be bought or not. This labelling process comes with some business assumptions of good profit values and empirical rules, and by doing a preliminary analysis we get our first actionable insights that will be presented to business experts. After validating this labelling empirical process, we can train a classification model, to make the decision-making more robust and less biased by the history.

Useful Attributes for business question:

1. date, price, condition, grade, zipcode
2. date, price, condition, grade, zipcode
3. date, price, bedrooms, bathrooms, sqft_living, sqft_above

1.13 Feature Selection


In [70]:
house_df_fs = house_df_eda[['id','date','price','condition','grade','zipcode','bedrooms','bathrooms','sqft_living','sqft_above','sqft_basement','sqft_lot']]

1.14 Feature engineering and derivation (Identify extra useful features based on solution planning)


(Check planning document)

Useful Attributes for business question:

1. date, price, condition, grade, zipcode, plus:
    1. median_price - based on zipcode region of the selected houses
    2. percentage_value_below_median
    3. house_total_m2 - m2_living + m2_lot
    4. price/house_total_m2 - will help more obtaining more informative insights on comparisons, since we are normalizing the 
    price by the house size, and then there is a more fair comparison.
2. From the selected houses to buy create and use:
    1. best_season - based on zipcode region of the selected houses, and its selling date
    2. selling_price - based on the price and the season
    3. profit - will result from difference between selling_price and price 
3. From the selected houses to buy create and use:
    1.
    2. 
In [71]:
house_df_fs[['m2_living','m2_above','m2_basement','m2_lot']] = house_df_fs[['sqft_living','sqft_above','sqft_basement','sqft_lot']] * 0.0929

house_df_fs.drop(['sqft_living','sqft_above','sqft_basement','sqft_lot'], axis=1, inplace=True)

house_df_fs.head()
/home/data-madness/Documents/SynologyDrive/Comunidade DS/Python do Zero ao DS/HouseSales-RecommendationSystem/houserocket_env/lib/python3.8/site-packages/pandas/core/frame.py:3641: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]
/home/data-madness/Documents/SynologyDrive/Comunidade DS/Python do Zero ao DS/HouseSales-RecommendationSystem/houserocket_env/lib/python3.8/site-packages/pandas/core/frame.py:4906: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(
Out[71]:
id date price condition grade zipcode bedrooms bathrooms m2_living m2_above m2_basement m2_lot
0 7129300520 2014-10-13 221900.00 3 7 98178 3 1.00 109.62 109.62 0.00 524.88
1 6414100192 2014-12-09 538000.00 3 7 98125 3 2.25 238.75 201.59 37.16 672.78
2 5631500400 2015-02-25 180000.00 3 6 98028 2 1.00 71.53 71.53 0.00 929.00
3 2487200875 2014-12-09 604000.00 5 7 98136 4 3.00 182.08 97.55 84.54 464.50
4 1954400510 2015-02-18 510000.00 3 8 98074 3 2.00 156.07 156.07 0.00 750.63
In [72]:
house_df_fs['house_total_m2'] = house_df_fs['m2_living'] + house_df_fs['m2_lot']

house_df_fs.head()
/tmp/ipykernel_10449/3069127675.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  house_df_fs['house_total_m2'] = house_df_fs['m2_living'] + house_df_fs['m2_lot']
Out[72]:
id date price condition grade zipcode bedrooms bathrooms m2_living m2_above m2_basement m2_lot house_total_m2
0 7129300520 2014-10-13 221900.00 3 7 98178 3 1.00 109.62 109.62 0.00 524.88 634.51
1 6414100192 2014-12-09 538000.00 3 7 98125 3 2.25 238.75 201.59 37.16 672.78 911.53
2 5631500400 2015-02-25 180000.00 3 6 98028 2 1.00 71.53 71.53 0.00 929.00 1000.53
3 2487200875 2014-12-09 604000.00 5 7 98136 4 3.00 182.08 97.55 84.54 464.50 646.58
4 1954400510 2015-02-18 510000.00 3 8 98074 3 2.00 156.07 156.07 0.00 750.63 906.70

2. Find Main Insights

2.1 House Recommendation report (Answering Business Team questions)

2.1.1 Which houses should be bought and for what price?

2.1.1.1 Houses to buy labelling process

Business Assumptions


1. The houses that have a price value lower than the median and are in good conditions, can be sold for a higher price, so are good to buy.

2. The houses that have a price value lower than the median and are in bad conditions, cannot be sold for a higher price, so are not good to buy.

3. The houses that have a price value higher than the median, independently from the condition, are not good to buy and take profit.

In [73]:
house_df_fs['price_m2'] = house_df_fs['price']/house_df_fs['house_total_m2']
/tmp/ipykernel_10449/2121288087.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  house_df_fs['price_m2'] = house_df_fs['price']/house_df_fs['house_total_m2']
In [74]:
zipcode_median = house_df_fs[['price_m2', 'zipcode']].groupby('zipcode').median().reset_index()
zipcode_median.columns = ['zipcode','median_price_m2']
In [75]:
house_df_fs = pd.merge(house_df_fs,zipcode_median,on='zipcode',how='inner')
In [76]:
house_df_fs.head()
Out[76]:
id date price condition grade zipcode bedrooms bathrooms m2_living m2_above m2_basement m2_lot house_total_m2 price_m2 median_price_m2
0 7129300520 2014-10-13 221900.00 3 7 98178 3 1.00 109.62 109.62 0.00 524.88 634.51 349.72 344.10
1 4060000240 2014-06-23 205425.00 4 6 98178 2 1.00 81.75 81.75 0.00 629.86 711.61 288.67 344.10
2 4058801670 2014-07-17 445000.00 3 8 98178 3 2.25 195.09 150.50 44.59 761.87 956.96 465.01 344.10
3 2976800796 2014-09-25 236000.00 3 7 98178 3 1.00 120.77 120.77 0.00 547.92 668.69 352.93 344.10
4 6874200960 2015-02-27 170000.00 3 6 98178 2 1.00 79.89 79.89 0.00 489.12 569.01 298.76 344.10
In [77]:
house_df_fs['condition'].value_counts()
Out[77]:
3    14031
4     5679
5     1701
2      172
1       30
Name: condition, dtype: int64
In [78]:
house_df_fs['status'] = house_df_fs[['price_m2','condition','median_price_m2']].apply(lambda x: 'buy' if (x[0] < x[2]) & (x[1] >= 3)\
                                                                          else 'do not buy',  axis = 1)
In [79]:
house_df_fs[['id','zipcode','price_m2','median_price_m2','condition','status']].head()
Out[79]:
id zipcode price_m2 median_price_m2 condition status
0 7129300520 98178 349.72 344.10 3 do not buy
1 4060000240 98178 288.67 344.10 4 buy
2 4058801670 98178 465.01 344.10 3 do not buy
3 2976800796 98178 352.93 344.10 3 do not buy
4 6874200960 98178 298.76 344.10 3 buy

2.1.1.2 Add a priority order for selected houses to buy

Order by percentage and then by condition

In [80]:
# percentage_value_below_median
In [ ]:
 

3. Set new business hypothesis


  1. Houses that have a water view, are 20% more expensive on average.
  1. Houses with year built older than 1955, are 50% cheaper on average.
  1. Houses without basement are 40% bigger than with basement.
  1. The growth rate of the houses price YoY (Year over Year) is 10%.
  1. Houses with 3 bathrooms have a MoM (Month over Month) growth of 15%.
  1. Houses in the mountains are valued 10% more in the Summer than in the Winter.
  1. Houses with sqft_living
In [ ]:
 

3.1 Test New Business Hypothesis


In [ ]:
 
In [ ]:
 
In [84]:
house_df_eda['renovated'] = house_df_eda['yr_renovated'].apply(lambda x: 1 if x!=0 else 0)
house_df_eda['renovated'].value_counts()
Out[84]:
0    20699
1      914
Name: renovated, dtype: int64
In [85]:
house_df_eda['region'] = house_df_eda['lat'].apply(lambda x: 'North' if x > 47.46 else 'South')

sampleNorth = house_df_eda[house_df_eda['region'] == 'North'].sample(100)
sampleSouth = house_df_eda[house_df_eda['region'] == 'South'].sample(100)

house_df_sampled = pd.concat([sampleNorth,sampleSouth])

house_df_sampled['price_cube'] = house_df_sampled['price']**1.5
In [86]:
data_map = house_df_sampled[['id','zipcode', 'lat', 'long', 'price','bedrooms','renovated','region','price_cube']]

map = px.scatter_mapbox(data_map, lat='lat', lon='long',
                       hover_name='id',
                       hover_data=['price'],
                       size= 'price_cube',
                       color='region',
                       color_discrete_sequence=['red','green'],
                       size_max=30,
                       zoom=10,
                       height=300)

map.update_layout(mapbox_style='open-street-map')
map.update_layout(height=600, margin={'r':0, 't':0, 'l':0, 'b':0})

map.show()
In [ ]: